Data preparation

Reading and initial preprocessing

Load and bind datasets into single one. Adding manufacturer column.

df <- map2(# map through file and manufacturer names and read dataframes
    c("audi", "bmw", "merc", "vw"), # filename
    c("Audi", "BMW", "Mercedes", "Volkswagen"), # manufacturer
    function(filename, manufacturer) {
        read_csv(glue("./data/{filename}.csv"),
            col_types = "fiififidd"
        ) %>%
            mutate(manufacturer = as_factor(manufacturer)) # add column
    }
) %>%
    reduce(~ bind_rows(.x, .y)) # Bind rows into single dataframe

Get a sample of 5000:

set.seed(19990428)
df <- df %>%
    slice_sample(n = 5000)

Add manufacturer to model, convert year and engineSize to factors and add auxiliary factor variables for the numeric ones:

df <- df %>% mutate(
    model = as_factor(paste0(manufacturer, " - ", model)),
    age = 2020 - year,
    aux_price = cut_number(price / 1000, 4),
    aux_mileage = cut_number(mileage / 1000, 4),
    aux_mpg = cut_number(mpg, 4),
    aux_tax = cut_number(tax, 2),
    aux_age = cut_number(age, 4),
    year = as_factor(year),
    engineSize = as_factor(engineSize)
)

Summary

#> Registered S3 method overwritten by 'papeR':
#>   method    from
#>   Anova.lme car
#> Warning in kable_styling(., latex_options = c("HOLD_position"), full_width =
#> FALSE): Please specify format in kable. kableExtra can customize either HTML or
#> LaTeX outputs. See https://haozhu233.github.io/kableExtra/ for details.
Summary of numeric variables
N Mean SD Min Q1 Median Q3 Max
price 5000 21571.45 11544.02 1295.0 13990.0 19498.0 26030.0 149948.0
mileage 5000 23054.10 22309.69 1.0 5904.0 16500.0 33297.0 168000.0
tax 5000 123.60 62.56 0.0 125.0 145.0 145.0 570.0
mpg 5000 54.19 18.11 1.1 45.6 53.3 61.4 470.8
age 5000 2.78 2.10 0.0 1.0 3.0 4.0 19.0
#> Warning in kable_styling(., latex_options = c("HOLD_position"), full_width =
#> FALSE): Please specify format in kable. kableExtra can customize either HTML or
#> LaTeX outputs. See https://haozhu233.github.io/kableExtra/ for details.
#> Warning in footnote(., general = "year, model and engineSize omitted"): Please
#> specify format in kable. kableExtra can customize either HTML or LaTeX outputs.
#> See https://haozhu233.github.io/kableExtra/ for details.
Summary of numeric variables
Level N %
transmission Manual 1784 35.7
Automatic 1332 26.6
Semi-Auto 1884 37.7
Other 0 0.0
fuelType Petrol 2065 41.3
Diesel 2860 57.2
Hybrid 65 1.3
Other 10 0.2
Electric 0 0.0
manufacturer Audi 1072 21.4
BMW 1106 22.1
Mercedes 1340 26.8
Volkswagen 1482 29.6
#> Warning in kable_styling(., latex_options = c("HOLD_position"), full_width =
#> FALSE): Please specify format in kable. kableExtra can customize either HTML or
#> LaTeX outputs. See https://haozhu233.github.io/kableExtra/ for details.
Summary of auxiliary factor variables
Level N %
aux_price [1.29,14] 1254 25.1
(14,19.5] 1252 25.0
(19.5,26] 1244 24.9
(26,150] 1250 25.0
aux_mileage [0.001,5.9] 1251 25.0
(5.9,16.5] 1252 25.0
(16.5,33.3] 1247 24.9
(33.3,168] 1250 25.0
aux_mpg [1.1,45.6] 1338 26.8
(45.6,53.3] 1291 25.8
(53.3,61.4] 1188 23.8
(61.4,471] 1183 23.7
aux_tax [0,145] 3969 79.4
(145,570] 1031 20.6
aux_age [0,1] 1888 37.8
(1,3] 1453 29.1
(3,4] 871 17.4
(4,19] 788 15.8

If we count the number of NA values per row, we find that there are no explicit NA in the sample, as shown in :

#> Warning in kable_styling(., latex_options = c("HOLD_position"), full_width =
#> FALSE): Please specify format in kable. kableExtra can customize either HTML or
#> LaTeX outputs. See https://haozhu233.github.io/kableExtra/ for details.
Number of missing and zero values per row
Variable Missing Zeros
model 0 0
year 0 0
price 0 0
transmission 0 0
mileage 0 0
fuelType 0 0
tax 0 152
mpg 0 0
engineSize 0 13
manufacturer 0 0

Outliers

There are no severe outliers.

Analysis

Determine if the response variable (price) has an acceptably normal distribution. Address test to discard serial correlation.

shows the QQ plots.

QQ plots

QQ plots

#> Warning in ks.test(., "pnorm", mean = mean(.), sd = sd(.)): ties should not be
#> present for the Kolmogorov-Smirnov test
#> Warning in ks.test(., "pnorm", mean = mean(.), sd = sd(.)): ties should not be
#> present for the Kolmogorov-Smirnov test

We perform a Durbin-Watson test with the null hypothesis that the autocorrelation of the disturbances is 0. We obtain a p-value of 0.95 so we fail to reject the null hypothesis.

The results of the test are consistent with the visual interpretation of the ACF plot1 shown in . All the values except lag = 33 lie within the confidence interval of 95%, showing that there is no autocorrelation.

ACF plot for price

ACF plot for price

Indicate by exploration of the data, which are apparently the variables most associated with the response variable (use only the indicated variables).

Spearman correlation plot

Spearman correlation plot

#> Warning in kable_styling(., latex_options = c("HOLD_position"), full_width =
#> FALSE): Please specify format in kable. kableExtra can customize either HTML or
#> LaTeX outputs. See https://haozhu233.github.io/kableExtra/ for details.
Spearman correlation coefficients
price mileage tax mpg age
price 1.00 -0.64 0.39 -0.56 -0.69
mileage -0.64 1.00 -0.25 0.43 0.85
tax 0.39 -0.25 1.00 -0.59 -0.29
mpg -0.56 0.43 -0.59 1.00 0.41
age -0.69 0.85 -0.29 0.41 1.00
#> Warning in kable_styling(., latex_options = c("HOLD_position"), full_width =
#> FALSE): Please specify format in kable. kableExtra can customize either HTML or
#> LaTeX outputs. See https://haozhu233.github.io/kableExtra/ for details.
Qualitative variable correlation with log(price)
Variable R2
year 0.48
transmission 0.29
engineSize 0.39
aux_mileage 0.34
aux_age 0.42
aux_mpg 0.26
manufacturer 0.10
fuelType 0.01
outlier 0.00
#> Warning in kable_styling(., latex_options = c("HOLD_position"), full_width =
#> FALSE): Please specify format in kable. kableExtra can customize either HTML or
#> LaTeX outputs. See https://haozhu233.github.io/kableExtra/ for details.
Qualitative categories correlation with log(price)
Variable Estimate p.value
engineSize=6.2 1.92 0.00
engineSize=1.9 -1.86 0.00
year=2001 -1.82 0.00
engineSize=1.7 -1.71 0.00
engineSize=5.2 1.68 0.00
year=2002 -1.35 0.00
engineSize=2.7 -1.34 0.00
year=2020 1.29 0.00
engineSize=6 1.29 0.00
year=2019 1.21 0.00
engineSize=3.7 -1.15 0.04
engineSize=5.5 1.11 0.00
year=2018 1.00 0.00
engineSize=4 0.98 0.00
year=2005 -0.90 0.00
engineSize=1.2 -0.85 0.00
engineSize=4.4 0.84 0.00
year=2017 0.79 0.00
engineSize=2.9 0.77 0.00
year=2007 -0.64 0.00
engineSize=1 -0.63 0.00
engineSize=1.8 -0.63 0.00
year=2016 0.63 0.00
year=2015 0.53 0.00
year=2010 -0.46 0.00
aux_age=[0,1] 0.46 0.00
engineSize=1.6 -0.44 0.00
engineSize=1.4 -0.42 0.00
aux_age=(4,19] -0.42 0.00
year=2009 -0.40 0.00
aux_mileage=(33.3,168] -0.40 0.00
aux_mpg=[1.1,45.6] 0.40 0.00
year=2006 -0.38 0.00
year=2014 0.38 0.00
aux_mileage=[0.001,5.9] 0.36 0.00
transmission=Manual -0.36 0.00
engineSize=3 0.32 0.00
year=2008 -0.29 0.00
engineSize=2.3 0.27 0.04
manufacturer=Volkswagen -0.25 0.00
aux_mpg=(61.4,471] -0.24 0.00
year=2013 0.24 0.00
transmission=Semi-Auto 0.23 0.00
fuelType=Hybrid 0.22 0.00
engineSize=2.1 -0.20 0.02
aux_mpg=(53.3,61.4] -0.16 0.00
aux_mileage=(5.9,16.5] 0.16 0.00
aux_age=(3,4] -0.14 0.00
manufacturer=Mercedes 0.12 0.00
transmission=Automatic 0.12 0.00
aux_mileage=(16.5,33.3] -0.12 0.00
year=2012 0.10 0.00
year=2011 0.09 0.00
outlier=FALSE 0.08 0.00
outlier=TRUE -0.08 0.00
fuelType=Petrol -0.08 0.00
engineSize=1.3 0.07 0.00
manufacturer=Audi 0.07 0.00
engineSize=2 -0.07 0.00
manufacturer=BMW 0.05 0.00
fuelType=Diesel -0.01 0.00
year=2004 0.00 0.00

Define a polytomic factor f.age for the covariate car age according to its quartiles, and argue if the average price depends on the level of age. Statistically justify the answer.

Calculate and interpret the anova model that explains car price according to the age factor and the fuel type.

#> `summarise()` has grouped output by 'aux_age'. You can override using the `.groups` argument.

Do you think that the variability of the price depends on both factors? Does the relation between price and age factor depend on fuel type?

Calculate the linear regression model that explains the price from the age: interpret the regression line and assess its quality.

#> Warning in kable_styling(., latex_options = c("HOLD_position"), full_width =
#> FALSE): Please specify format in kable. kableExtra can customize either HTML or
#> LaTeX outputs. See https://haozhu233.github.io/kableExtra/ for details.
Linear regression on price \(\sim\) age
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29687.28 229.64 129.28 0
age -2921.89 65.98 -44.28 0
#> Warning in kable_styling(., latex_options = c("HOLD_position"), full_width =
#> FALSE): Please specify format in kable. kableExtra can customize either HTML or
#> LaTeX outputs. See https://haozhu233.github.io/kableExtra/ for details.
Linear regression on price \(\sim\) age statistics
statistic value
Residual standard error 9784.206
Degrees of freedom 2, 4998, 2
Multiple R-squared 0.281792
Adjusted R-squared 0.2816483
F-statistic 1960.987, 1.000, 4998.000
#> `geom_smooth()` using formula 'y ~ x'

#> Warning in kable_styling(., latex_options = c("HOLD_position"), full_width =
#> FALSE): Please specify format in kable. kableExtra can customize either HTML or
#> LaTeX outputs. See https://haozhu233.github.io/kableExtra/ for details.
Linear regression on log(price) \(\sim\) age
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.30 0.01 1224.34 0
age -0.16 0.00 -66.11 0
#> Warning in kable_styling(., latex_options = c("HOLD_position"), full_width =
#> FALSE): Please specify format in kable. kableExtra can customize either HTML or
#> LaTeX outputs. See https://haozhu233.github.io/kableExtra/ for details.
Linear regression on log(price) \(\sim\) age statistics
statistic value
Residual standard error 0.3585727
Degrees of freedom 2, 4998, 2
Multiple R-squared 0.4665263
Adjusted R-squared 0.4664196
F-statistic 4370.785, 1.000, 4998.000
#> `geom_smooth()` using formula 'y ~ x'

What is the percentage of the price variability that is explained by the age of the car?

Do you think it is necessary to introduce a quadratic term in the equation that relates the price to its age?

Are there any additional explanatory numeric variables needed to the car price? Study collinearity effects.

After controlling by numerical variables, indicate whether the additive effect of the available factors on the price are statistically significant.

Select the best model available so far. Interpret the equations that relate the explanatory variables to the answer (rate).

Study the model that relates the logarithm of the price to the numerical variables.

Once explanatory numerical variables are included in the model, are there any main effects from factors needed?

Graphically assess the best model obtained so far.

Assess the presence of outliers in the studentized residuals at a 99% confidence level. Indicate what those observations are.

Study the presence of a priori influential data observations, indicating their number according to the criteria studied in class.

Study the presence of a posteriori influential values, indicating the criteria studied in class and the actual atypical observations.

Given a 5-year-old car, the rest of numerical variables on the mean and factors on the reference

Summarize what you have learned by working with this interesting real dataset.


  1. lag 0 is omitted for clarity↩︎